transformers.LineByLineTextDataset (deprecated)
huggingface/datasetsに置き換えられた
This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library.
https://github.com/huggingface/transformers/blob/v4.18.0/src/transformers/data/datasets/language_modeling.py#L35-L38
legacy example https://github.com/huggingface/transformers/blob/v4.18.0/examples/legacy/run_language_modeling.py
推奨 datasetsを使う例 https://github.com/huggingface/transformers/blob/v4.18.0/examples/pytorch/language-modeling/run_mlm.py
FutureWarningのメッセージでも案内されている
__len__と__getitem__が実装されている
1つ1つの要素は辞書 {'input_ids': tensor([0, ..., 2]}
torch.int64(トークンのid)